{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "9e0b88084b8a7281f22f37208564a3a292f5693e"
   },
   "source": [
    "<center>\n",
    "<img src=\"https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg\">\n",
    "    \n",
    "## [mlcourse.ai](mlcourse.ai) – Open Machine Learning Course \n",
    "\n",
    "<center>Auteur: [Yury Kashnitskiy](http://yorko.github.io) <br>\n",
    "Traduit et édité par [Sergey Isaev](https://www.linkedin.com/in/isvforall/), [Artem Trunov](https://www.linkedin.com/in/datamove/), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina/), [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/) et [Ousmane Cissé](https://github.com/oussou-dev)<br>Contenu soumis à la licence [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "# <center> Devoir #1 (démo)\n",
    "## <center>  Analyse exploratoire des données avec Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Dans cette tâche, vous devez utiliser la librairie Pandas pour répondre à quelques questions liées au dataset (ensemble de données) [Adulte](https://archive.ics.uci.edu/ml/datasets/Adult). (Vous n'avez pas à télécharger les données - vous les trouverez ci-joints). Choisissez vos réponses via ce [formulaire](https://docs.google.com/forms/d/1uY7MpI2trKx6FLWZte0uVh3ULV4Cm_tDud0VDFGCOKg). C'est une version de démonstration. Ainsi, en soumettant le formulaire, vous verrez un lien vers le fichier .ipynb de la solution.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "f9641558236566bc4515b86271f69b2934afe7a9"
   },
   "source": [
    "Valeurs uniques de toutes les caractéristiques (pour plus d'informations, veuillez consulter les liens ci-dessus):\n",
    "- `age`: continuous.\n",
    "- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.\n",
    "- `fnlwgt`: continuous.\n",
    "- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.\n",
    "- `education-num`: continuous.\n",
    "- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.\n",
    "- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.\n",
    "- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.\n",
    "- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.\n",
    "- `sex`: Female, Male.\n",
    "- `capital-gain`: continuous.\n",
    "- `capital-loss`: continuous.\n",
    "- `hours-per-week`: continuous.\n",
    "- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.   \n",
    "- `salary`: >50K,<=50K"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "3570c8e98366e6e09f6e9e09bfef1385a74898a8"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "00e99b05a3236a79b7177f4ddb63f8844c275f78"
   },
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"../input/adult.data.csv\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**1. Combien d'hommes et de femmes (* caractéristique sex * ) sont représentés dans ce dataset ?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "b47a687d2ef3a449909c9212421d83e42c039bd8"
   },
   "outputs": [],
   "source": [
    "data[\"sex\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**2. Quel est l’âge moyen des femmes?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "6936751bc2061873c52c052b8977b1fe0ac826bd"
   },
   "outputs": [],
   "source": [
    "data.loc[data[\"sex\"] == \"Female\", \"age\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**3. Quel est le pourcentage de citoyens allemands (caractéristique *native-country*)?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "4edbe332b55c59e05da5574920b0f91283b7814a"
   },
   "outputs": [],
   "source": [
    "float((data[\"native-country\"] == \"Germany\").sum()) / data.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**4-5. Quels sont la moyenne et la déviation standard (écart-type) de l’âge pour ceux qui gagnent plus de 50k par an (caractéristique * salary *) et ceux qui gagnent moins de 50k par an?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "94d017ce7c339eae7a13a816cee415527837d802"
   },
   "outputs": [],
   "source": [
    "ages1 = data.loc[data[\"salary\"] == \">50K\", \"age\"]\n",
    "ages2 = data.loc[data[\"salary\"] == \"<=50K\", \"age\"]\n",
    "print(\n",
    "    \"The average age of the rich: {0} +- {1} years, poor - {2} +- {3} years.\".format(\n",
    "        round(ages1.mean()),\n",
    "        round(ages1.std(), 1),\n",
    "        round(ages2.mean()),\n",
    "        round(ages2.std(), 1),\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**6. Est-il vrai que les personnes qui gagnent plus de 50k ont au moins terminé leurs études secondaires? (* education - Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters * ou * Doctorat *)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "8ca42b7a1befd52dbc72781a7c55505c2fb0909c"
   },
   "outputs": [],
   "source": [
    "data.loc[data[\"salary\"] == \">50K\", \"education\"].unique()  # No"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**7. Affichez les statistiques d'âge pour chaque race (caractéristique * race *) et chaque sexe (caractéristique * sex *). Utilisez * groupby () * et * describe () *. Trouvez l'âge maximum des hommes de race * Amer-Indian-Eskimo *.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "f8d3404d2871b68504a3920faa635d723ac2f45b"
   },
   "outputs": [],
   "source": [
    "for (race, sex), sub_df in data.groupby([\"race\", \"sex\"]):\n",
    "    print(\"Race: {0}, sex: {1}\".format(race, sex))\n",
    "    print(sub_df[\"age\"].describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**8. Parmi ceux qui gagnent plus de 50k, quel est l'état matrimonial (caractéristique marital-status) : marié ou célibataire  le plus représenté ?  \n",
    "On considère comme mariés ceux qui ont un marital-status commençant par Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), les autres sont considérés comme célibataires (bachelors).**  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "0d76dd6753edc9e15a3ecf84d7ea9f44cf497e0d"
   },
   "outputs": [],
   "source": [
    "data.loc[\n",
    "    (data[\"sex\"] == \"Male\")\n",
    "    & (\n",
    "        data[\"marital-status\"].isin(\n",
    "            [\"Never-married\", \"Separated\", \"Divorced\", \"Widowed\"]\n",
    "        )\n",
    "    ),\n",
    "    \"salary\",\n",
    "].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "3ebaeb074e6cc61d190ff5c8a1f9e54ece35e166"
   },
   "outputs": [],
   "source": [
    "data.loc[\n",
    "    (data[\"sex\"] == \"Male\") & (data[\"marital-status\"].str.startswith(\"Married\")),\n",
    "    \"salary\",\n",
    "].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "6caad66a24cf60a6105f41955b14ba9f25501c87"
   },
   "outputs": [],
   "source": [
    "data[\"marital-status\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "a100bde4f74a560148f2bf4cc30f2bf6cdc44cac"
   },
   "source": [
    "It's good to be married :)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**9. Quel est le nombre d'heures maximum de travail par semaine effectué par une personne (caractéritique * hours-per-week *)?  \n",
    "  Combien de personnes travaillent autant, et quel est le pourcentage de celles qui gagnent beaucoup (>50k) parmi elles?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "93ac2854df19a96970e330550c0e1985fb38edc4"
   },
   "outputs": [],
   "source": [
    "max_load = data[\"hours-per-week\"].max()\n",
    "print(\"Max time - {0} hours./week.\".format(max_load))\n",
    "\n",
    "num_workaholics = data[data[\"hours-per-week\"] == max_load].shape[0]\n",
    "print(\"Total number of such hard workers {0}\".format(num_workaholics))\n",
    "\n",
    "rich_share = (\n",
    "    float(\n",
    "        data[(data[\"hours-per-week\"] == max_load) & (data[\"salary\"] == \">50K\")].shape[0]\n",
    "    )\n",
    "    / num_workaholics\n",
    ")\n",
    "print(\"Percentage of rich among them {0}%\".format(int(100 * rich_share)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**10. Calculez le temps de travail moyen (*hours-per-week*) pour ceux qui sont peu et beaucoup payés (*salary*) pour chaque pays (*native-country*). Qu'obtient-on pour le Japon?**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "fc67334178627c4410c82cf3452b286f11a43732",
    "collapsed": true
   },
   "source": [
    "Simple method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "6833dda87a7268b668ddce2d94f44abbac46e8e8"
   },
   "outputs": [],
   "source": [
    "for (country, salary), sub_df in data.groupby([\"native-country\", \"salary\"]):\n",
    "    print(country, salary, round(sub_df[\"hours-per-week\"].mean(), 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "7ad90f765ec1d95fdd186b220c69b86f5405387a"
   },
   "source": [
    "Elegant method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "214675c86e5d7eb24fe64850dc5aa2d7bbf95fa8"
   },
   "outputs": [],
   "source": [
    "pd.crosstab(\n",
    "    data[\"native-country\"],\n",
    "    data[\"salary\"],\n",
    "    values=data[\"hours-per-week\"],\n",
    "    aggfunc=np.mean,\n",
    ").T"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  },
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
